Introduction
This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Basic R concepts and commands |
| 2. | Visualizing data |
| 3. | Transforming data |
| 4. | Exploring data (EDA) |
| 5. | Creating and using tibbles |
| 6. | Tidying data |
Course coordinates
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Winter 2018/2019: Mondays, 13:30–15:00, C511.
- Links to current course syllabus | ZeUS | Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Essential commmands | Data science for psychologists
## 2018 07 06
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## Preparations: -----
library(tidyverse)
## Visualize data and EDA: ggplot2 and dplyr -----
# ...
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- Visualizing data
In the following, we introduce some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.
See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) and the links provided below for more detailed information.
Commands and examples
General structure of ggplot calls
A generic template for creating a graph with ggplot is:
# Generic ggplot template:
ggplot(data = <DATA>) +
<GEOM_fun>(mapping = aes(<MAPPING>), <arg_1 = val_1, ..., arg_n = val_n>) +
<FACET_fun> + # optional
<LOOK_GOOD_fun> # optional
# Minimal ggplot template:
ggplot(<DATA>) +
<GEOM_fun>(aes(<MAPPING>) The generic template includes the following parts:
<DATA>is a data frame or tibble that contains the data that is to be plotted.<GEOM_fun>is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified inaes(<MAPPING>). (A “mapping” specifies what goes where.)- A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
- in the aesthetic mapping (when varying visual features according to data properties), or
- by setting its arguments to specific values in
<arg_1 = val_1, ..., arg_n = val_n>(when remaining constant).
An optional
<FACET_fun>splits a complex plot into multiple subplots.A sequence of optional
<LOOK_GOOD_fun>adjusts the visual features of plots (e.g., by adding themes, plot titles and labels, color scales, and coordinate systems).
Some examples that illustrate the use of these components are:
A histogram
A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:
library(ggplot2)
# Data: ------
## Using mpg data:
# ?ggplot2::mpg
# mpg
# (A) Histogram: ------
# A minimal histogram:
hi1 <- ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(binwidth = 1)
hi1
# The same histogram:
hi1b <- ggplot(mpg) +
geom_histogram(aes(x = cty)) # set mappings for THIS geoms
hi1b
# (B) Adding aesthetics, labels and themes: ------
# Enhanced version of the same plot:
hi2 <- ggplot(mpg) +
geom_histogram(aes(x = cty), binwidth = 1, fill = "forestgreen", color = "black") +
labs(title = "Distribution of fuel economy in city environments",
x = "cty (miles per gallon)",
caption = "Data from ggplot2::mpg") +
theme_light()
hi2A scatterplot
A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data:
# (A) Scatterplot: ------
# A minimal scatterplot + reference line:
sp1 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline()
sp1Dealing with overplotting
A common issue with scatterplots is so-called overplotting: Multiple points appear on the same position.
Here are some ways of dealing with this issue:
jitteradds randomness to positions;
alphauses transparency to show frequency of positions;
geom_sizeallows mapping values (e.g., frequency) to object size;facet_wrapallows disentangling plots by levels of variables.
Some examples include:
## Dealing with overplotting: -----
# 1. One way of dealing with overplotting is
# adding randomness to point positions:
sp2 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "jitter") +
geom_abline()
sp2
# 2. Another way of dealing with overplotting is
# using transparency (via setting alpha to < 1):
sp3 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "identity",
pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
geom_abline(linetype = 2, color = "firebrick") # +
# geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
sp3
# Adding labels and themes to plots:
sp4 <- sp3 + # use the plot defined above
labs(title = "Fuel economy on highway vs. city",
x = "City (miles per gallon)",
y = "Highway (miles per gallon)",
caption = "Data from ggplot2::mpg") +
# coord_fixed() +
theme_bw()
sp4
# (C) Grouping (by a categorical variable): ------
# Using facets to avoid overplotting:
sp5 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline() +
facet_wrap(~class) +
theme_bw()
sp5
# Grouping by color:
sp6 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy, color = class),
position = "jitter", alpha = 1/2, size = 4) +
geom_abline(linetype = 2) +
theme_bw()
sp6
# Grouping by facets:
sp7 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy),
position = "jitter", alpha = 1/2, size = 2) +
geom_abline(linetype = 2) +
facet_wrap(~class) +
theme_bw()
sp7See https://ggplot2.tidyverse.org/reference/ for more examples.
Note some details:
ggplotrequires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted<DATA>is in a table (data frame or tibble) in long format and contains independent variables as factors.The arguments
data =andmappings =can be omitted, but an aesthetic mappingaes(<MAPPING>)for at least one geom is needed.Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).
When multiple geoms use the same mappings, their common
aes(<MAPPING>)can be moved into the initialggplotcall (behind<DATA>).In
ggplot, a sequence of commands is combined by+, rather than%>%.The visual appearance of plots are highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots).
EDA
Creating good graphs is both an art and a craft. The key to creating good graphs requires answering 2 sets of questions:
Knowing the number and type of variables to be plotted. This includes answering data-related questions like
- How many variables are there to plot?
- Are these variables categorical or continuous?
- Do some variables qualify (e.g., group) the values of others?
- How many variables are there to plot?
Knowing the intended type of plot. This includes answering functional questions like
- What is the purpose of this plot?
- What are possible plots for this purpose?
- Which of these would be the most appropriate plot?
Even when the questions of 1. and 2. are answered, creating good graphs with ggplot requires a lot of practice and many hours of trial-and-error experimentation.
Basic plot types
Histograms
A histogram shows counts of the values of 1 (typically continuous) variable. This is useful for evaluating the distribution of the variable:
library(ggplot2)
# Create data:
tb <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 15))
# Basic histogram:
ggplot(tb) +
geom_histogram(aes(x = iq), binwidth = 5)
# Pimped histogram:
ggplot(tb) +
geom_histogram(aes(x = iq), binwidth = 5,
fill = "gold", color = "black") +
labs(title = "Histogram", x = "IQ values", y = "Frequency in sample (n)",
caption = "[Using random iq data.]") +
theme_classic()More on histograms:
Scatterplots
A scatterplot shows the relationship between 2 (typically continuous) variables:
# Data:
ir <- as_tibble(iris)
ir
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
# Basic scatterplot:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))
# Using 3 different facets:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
facet_wrap(~Species)
# Pimped scatterplot:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, fill = Species), pch = 21, color = "black", size = 2, alpha = 1/2) +
facet_wrap(~Species) +
# coord_fixed() +
labs(title = "Scatterplot", x = "Length of petal", y = "Width of petal",
caption = "[Using iris data.]") +
theme_bw() +
theme(legend.position = "none")More on scatterplots:
Bar plots
Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:
Counts of cases
By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):
library(ggplot2)
## Data:
ggplot2::mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl cla…
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p com…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p com…
#> 3 audi a4 2 2008 4 manu… f 20 31 p com…
#> 4 audi a4 2 2008 4 auto… f 21 30 p com…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p com…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p com…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p com…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p com…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p com…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p com…
#> # ... with 224 more rows
# (1) Count number of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class))
# (b) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..))
# (c) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class), stat = "count")
# (d) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..), stat = "count")
# (e) pimped version:
ggplot(mpg) +
geom_bar(aes(x = class, fill = class),
# stat = "count",
color = "black") +
labs(title = "Counts of cars by class",
x = "Class of car", y = "Frequency") +
scale_fill_brewer(name = "Class:", palette = "Blues") +
theme_bw()Practice: Plot the number or frequency of cases in the mpg data by cyl (in at least 3 different ways).
Proportion of cases
An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:
library(ggplot2)
## Data:
ggplot2::mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl cla…
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p com…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p com…
#> 3 audi a4 2 2008 4 manu… f 20 31 p com…
#> 4 audi a4 2 2008 4 auto… f 21 30 p com…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p com…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p com…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p com…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p com…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p com…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p com…
#> # ... with 224 more rows
# (1) Proportion of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..prop.., group = 1))
# is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count../sum(..count..)))Practice: Plot the proportion of cases in the mpg data by cyl (in at least 3 different ways).
Bar plots of existing values
A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").
For instance, let’s plot a bar chart that shows the election data from the following tibble de:
| year | party | share |
|---|---|---|
| 2013 | CDU/CSU | 0.415 |
| 2013 | SPD | 0.257 |
| 2013 | Others | 0.328 |
| 2017 | CDU/CSU | 0.330 |
| 2017 | SPD | 0.205 |
| 2017 | Others | 0.465 |
- A version with 2 x 3 separate bars (using
position = "dodge"):
## Data: -----
de # => 6 x 3 tibble
#> # A tibble: 6 x 3
#> year party share
#> <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.33
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)
## (1) Bar chart with side-by-side bars (dodge): -----
## (a) minimal version:
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (A) 3 bars per election (position = "dodge"):
geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1
## (b) Version with text labels and customized colors:
bp_1 +
## pimping plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .01),
position = position_dodge(width = 1),
fontface = 2, color = "black") +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_bw()- A version with 2 bars with 3 segments (using
position = "stack"):
## Data: -----
de # => 6 x 3 tibble
#> # A tibble: 6 x 3
#> year party share
#> <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.33
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## (2) Bar chart with stacked bars: -----
## (a) minimal version:
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (B) 1 bar per election (position = "stack"):
geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2
## (b) Version with text labels and customized colors:
bp_2 +
## Pimping plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%")),
position = position_stack(vjust = .5),
color = rep(c("black", "white", "white"), 2),
fontface = 2) +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_classic()Bar plots with error bars
It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:
## Create data to plot: -----
n_cat <- 6
set.seed(101)
data <- tibble(
name = LETTERS[1:n_cat],
value = sample(seq(25, 50), n_cat),
sd = rnorm(n = n_cat, mean = 0, sd = 8))
data
#> # A tibble: 6 x 3
#> name value sd
#> <chr> <int> <dbl>
#> 1 A 34 1.71
#> 2 B 26 2.49
#> 3 C 42 9.39
#> 4 D 40 4.95
#> 5 E 30 -0.902
#> 6 F 31 7.34
## Error bars: -----
## x-aesthetic only:
# (a) errorbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd),
width = 0.4, color = "orange", alpha = 1, size = 1.0)
# (b) linerange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd),
color = "firebrick", alpha = 1, size = 2.5)
## Additional y-aesthetic:
# (c) crossbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "tomato4") +
geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
width = 0.3, color = "sienna1", alpha = 1, size = 1.0)
# (d) pointrange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "burlywood4") +
geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
color = "gold", alpha = 1.0, size = 1.2)More on barplots:
Drawing curves and lines
ToDo:
- adding trendlines
- lines of data (e.g., means)
Box plots
ToDo:
- show medians, quartiles, distribution, and outliers
Improving plots
Most default plots can be improved by fine-tuning their visual appearance. Popular levers for “pimping” plots include:
- colors: can be set withing geoms (variable when inside
aes(...), fixed outside), choosing or designing specific color scales;
- labels:
labs(...)allows setting titles, captions, axis labels, etc.;
- legends: can be (re-)moved or edited;
- themes: can be selected or modified.
More on data visualization
- study
vignette("ggplot")and the documentation forggplotand various geoms (e.g.,geom_); - study https://ggplot2.tidyverse.org/reference/ and its examples;
- see the cheat sheet on data visualization;
- read Chapter 3: Data visualization and Chapter 7: Exploratory data analysis (EDA) and complete their exercises.
Conclusion
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Basic R concepts and commands |
| 2. | Visualizing data |
| 3. | Transforming data |
| 4. | Exploring data (EDA) |
| 5. | Creating and using tibbles |
| 6. | Tidying data |
[Last update on 2018-11-11 09:32:46 by hn.]
This is different in Sankey diagrams, shown https://developers.google.com/chart/interactive/docs/gallery/sankey.↩